Welcome to the Exploratory Data Analysis of the CRAN historical data set. As you may already know, CRAN is a network of servers around the world which store code and documentation for the R packages over time. As of writing this EDA, CRAN had just over 18,000 packages available in it’s repository.
CRAN Website
Heads or Tails has done a great job of grabbing historical data, cleaning it up and preparing it for us R enthusiasts. Read about the approach he followed in his blogpost.
Read through the initial setup in the 4 tabs below.
First, some I import some useful libraries and set some plotting defaults.
# Data Manipulation
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(purrr)
library(stringr)
library(urltools)
library(magrittr)
# Plots
library(ggplot2)
library(naniar)
library(packcircles)
library(ggridges)
if(!require(streamgraph))
devtools::install_github("hrbrmstr/streamgraph", quiet = TRUE)
library(streamgraph)
library(patchwork)
# Tables
library(reactable)
# Settings
theme_set(theme_minimal(
base_size = 14,
base_family = "Menlo"))
theme_update(
plot.title.position = "plot"
)Let’s start be reading in the data. There are two CSV
files in this dataset. From his dataset
page:
cran_package_overview.csv: all R packages currently
available through CRAN, with (usually) 1 row per package…cran_package_history.csv: version history of virtually
all packages in the previous table...hist_dt <- read_csv(
"../input/r-package-history-on-cran/cran_package_history.csv",
col_types = cols(
package = col_character(),
version = col_character(),
date = col_date(format = "%Y-%m-%d"),
repository = col_character()
)
)
ov_dt <- read_csv(
"../input/r-package-history-on-cran/cran_package_overview.csv",
col_types = cols(
package = col_character(),
version = col_character(),
depends = col_character(),
imports = col_character(),
license = col_character(),
needs_compilation = col_logical(),
author = col_character(),
bug_reports = col_character(),
url = col_character(),
date_published = col_date(format = "%Y-%m-%d"),
description = col_character(),
title = col_character()
)
)
glimpse(hist_dt, 80)## Rows: 119,464
## Columns: 4
## $ package <chr> "A3", "A3", "A3", "AATtools", "ABACUS", "abbreviate", "abby…
## $ version <chr> "0.9.1", "0.9.2", "1.0.0", "0.0.1", "1.0.0", "0.1", "0.1", …
## $ date <date> 2013-02-07, 2013-03-26, 2015-08-16, 2020-06-14, 2019-09-20…
## $ repository <chr> "Archive", "Archive", "CRAN", "CRAN", "CRAN", "CRAN", "Arch…
glimpse(ov_dt, 80)## Rows: 18,388
## Columns: 12
## $ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", …
## $ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", …
## $ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R…
## $ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2…
## $ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file…
## $ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "…
## $ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues"…
## $ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "htt…
## $ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 201…
## $ description <chr> "Supplies tools for tabulating and analyzing the res…
## $ title <chr> "Accurate, Adaptable, and Accessible Error Metrics f…
I love to take the first peek into a dataset with the amazing {skimr}
package. We can see that we have the right data types set for all the
columns, dates have been imported correctly.
We can see in the history data that the 1st package
reported on CRAN was in 22 years ago on 1998-02-25!
Furthermore, the overview tells us there’s a package
{pack} last published/updated on 2008-09-08.
While there’s no missing data in the history dataset,
there are a bunch of missing values in the overview
dataset. Let’s explore this a bit more.
skimr::skim(hist_dt)| Name | hist_dt |
| Number of rows | 119464 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| package | 0 | 1 | 2 | 32 | 0 | 18372 | 0 |
| version | 0 | 1 | 3 | 15 | 0 | 10074 | 0 |
| repository | 0 | 1 | 4 | 7 | 0 | 2 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 1998-02-25 | 2022-07-17 | 2018-01-22 | 7119 |
skimr::skim(ov_dt)| Name | ov_dt |
| Number of rows | 18388 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| Date | 1 |
| logical | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| package | 0 | 1.00 | 2 | 32 | 0 | 18371 | 0 |
| version | 0 | 1.00 | 3 | 14 | 0 | 2112 | 0 |
| depends | 4862 | 0.74 | 2 | 218 | 0 | 4289 | 0 |
| imports | 4026 | 0.78 | 2 | 573 | 0 | 12041 | 0 |
| license | 0 | 1.00 | 3 | 54 | 0 | 159 | 0 |
| author | 0 | 1.00 | 4 | 4096 | 0 | 15652 | 0 |
| bug_reports | 10760 | 0.41 | 11 | 81 | 0 | 7578 | 0 |
| url | 8426 | 0.54 | 4 | 466 | 0 | 9548 | 0 |
| description | 0 | 1.00 | 5 | 7372 | 0 | 18368 | 0 |
| title | 0 | 1.00 | 5 | 210 | 0 | 18306 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_published | 1 | 1 | 2008-09-08 | 2022-07-17 | 2021-03-22 | 2542 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| needs_compilation | 0 | 1 | 0.24 | FAL: 13991, TRU: 4397 |
My favorite way of exploring missing data is to make it visible,
using Nick Tierney’s
amazing {naniar}
package. There are a few columns with missing data. Let’s look at these
more closely.
depends and imports have roughly a quarter
of the data as <NA>. These are packages which
roughly have no external dependencies. The difference between
the two can get a bit complex; best to learn about it in Hadley’s
chapter here.bug_reports and url are
missing. These don’t seem to be data issues as much as authors who don’t
have a place to issue bugs or website for their package
respectively.date_published has only 1 row with missing, which seems
like a data quality spill.ov_dt %>%
dplyr::arrange(date_published) %>%
vis_miss()Since this is an open ended exploration - unlike other EDA with the purpose of building a predictive model - before I continue to the plotting, I’d like to posit some questions which will guide the flow of further work. The first five questions are from Martin’s blog, with further questions which I think would be interesting to explore.
donedonedonedonedoneTo aid answering many of these, I first need to create a few new
features in the overview data set.
Read about the feature development in the tabs below. We go from
12 columns to 29 columns in the overview data set.
Per the R package section in Hadley’s book,
“an R package version is a sequence of at least two integers
separated by either . or -. For example,
1.0 and 0.9.1-10 are valid versions, but
1 and 1.0-devel are not”. Typically,
packages do follow the three number format of
<major>.<minor>.<patch>. I’m making an
assumption this is true, just to simplify things. I have a feeling it’ll
capture most of the cases.
This feature could help answer questions about version number progressions.
Adding 3 columns here…
split_versions <- function(dat) {
stopifnot("version" %in% names(dat))
dat %>%
separate(
version,
into =
c("major", "minor", "patch"),
sep = "\\.",
extra = "merge", # for versions like 1.0.3-3000, keep the '3-3000' together in the 3rd col
fill = "right",
remove = FALSE
)
}
ov_dt <- ov_dt %>% split_versions()
hist_dt <- hist_dt %>% split_versions()
glimpse(ov_dt, 100)## Rows: 18,388
## Columns: 15
## $ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
## $ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
## $ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
## $ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
## $ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
## $ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
## $ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
## $ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
## $ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
## $ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
## $ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
## $ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
## $ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
## $ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
## $ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
For the last published version of the package, how many dependencies and/and imports does each package have? My hypothesis is that packages in the past relied on lesser dependencies since they were more likely than not written in base R. With the recent explosion of adoption of R, and the adoption of the tidyverse framework, more recent packages would have a larger set of dependencies.
Adding 2 columns here…
ov_dt <- ov_dt %>%
mutate(
# Dependencies
num_dep = purrr::map_int(
.x = depends,
.f = function(x){
x %>%
stringr::str_split(",", simplify = TRUE) %>%
length()
}
),
num_dep = ifelse(is.na(depends), 0, num_dep),
# Imports
num_imports = purrr::map_int(
.x = imports,
.f = function(x){
x %>%
stringr::str_split(",", simplify = TRUE) %>%
length()
}
),
num_imports = ifelse(is.na(imports), 0, num_imports)
)
glimpse(ov_dt, 100)## Rows: 18,388
## Columns: 17
## $ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
## $ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
## $ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
## $ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
## $ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
## $ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
## $ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
## $ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
## $ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
## $ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
## $ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
## $ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
## $ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
## $ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
## $ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
## $ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
## $ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
Temporal features typically useful for aggregation downstream.
Adding 6 columns here…
hist_dt <- hist_dt %>%
mutate(
year = lubridate::year(date),
month = lubridate::month(date, label = TRUE),
day = lubridate::day(date),
wday = lubridate::wday(date, label = TRUE),
yr_mon = sprintf("%d-%s", year, month),
dt = lubridate::ym(paste0(year, "-", month))
)
ov_dt <- ov_dt %>%
filter(!is.na(date_published)) %>%
mutate(
year = lubridate::year(date_published),
month = lubridate::month(date_published, label = TRUE),
day = lubridate::day(date_published),
wday = lubridate::wday(date_published, label = TRUE),
yr_mon = sprintf("%d-%s", year, month),
dt = lubridate::ym(paste0(year, "-", month))
)
glimpse(ov_dt, 100)## Rows: 18,387
## Columns: 24
## $ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
## $ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
## $ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
## $ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
## $ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
## $ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
## $ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
## $ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
## $ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
## $ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
## $ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
## $ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
## $ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
## $ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
## $ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
## $ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
## $ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
## $ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
## $ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
## $ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
## $ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
## $ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
## $ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
## $ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
How long are the titles and description fields in the latest package submissions? Any interesting trends over time?
Adding 2 columns here…
ov_dt <- ov_dt %>%
mutate(
len_title = purrr::map_int(title, ~ stringr::str_count(.x, "\\w+")),
len_desc = purrr::map_int(description, ~ stringr::str_count(.x, "\\w+"))
)
glimpse(ov_dt, 100)## Rows: 18,387
## Columns: 26
## $ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
## $ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
## $ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
## $ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
## $ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
## $ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
## $ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
## $ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
## $ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
## $ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
## $ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
## $ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
## $ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
## $ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
## $ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
## $ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
## $ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
## $ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
## $ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
## $ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
## $ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
## $ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
## $ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
## $ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
## $ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
## $ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
The raw dataset has 159 unique levels for the license
variable.
ov_dt %>%
count(license) %>%
reactable(compact = TRUE, defaultSorted = "n", defaultSortOrder = "desc")But many of being quite similar to each other, some binning is in
order to extract some patterns. Here, I use the case_when
to bin together similar licenses. (I’m no expert in these licenses. I’m
sure I’m taking some liberties in the grouping here).
Adding 1 column here…
ov_dt <- ov_dt %>%
mutate(
license_cleaned = case_when(
str_detect(license, "^GPL-3") ~ "GPL-3",
str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*3") ~ "GPL-3",
str_detect(license, "^GPL-2") ~ "GPL-2",
str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*2") ~ "GPL-2",
str_detect(license, "^AGPL") ~ "AGPL",
str_detect(license, "^LGPL") ~ "LGPL",
str_detect(license, "Apache") ~ "Apache",
str_detect(license, "BSD") ~ "BSD",
str_detect(license, "LGPL") ~ "LGPL",
str_detect(license, "MIT") ~ "MIT",
str_detect(license, "CC0") ~ "CC0",
license == "GPL" ~ "GPL",
TRUE ~ "Other"
# Left these out after some trials with plots below:
# str_detect(license, "GNU") ~ "GNU",
# str_detect(license, "MPL") ~ "MPL",
# str_detect(license, "Unlimited") ~ "Unlimited",
# str_detect(license, "^CC") ~ "CC",
)
)
glimpse(ov_dt, 100)## Rows: 18,387
## Columns: 27
## $ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
## $ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
## $ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
## $ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
## $ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
## $ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
## $ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
## $ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
## $ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
## $ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
## $ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
## $ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
## $ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
## $ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
## $ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
## $ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
## $ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
## $ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
## $ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
## $ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
## $ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
## $ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
## $ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
## $ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
## $ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
## $ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
## $ license_cleaned <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…
Which domains do package authors typically use? My guess is GitHub rules them all, but is that true? Can we see any rise of other offerings like GitLab or BitBucket?
Adding 2 columns here…
ov_dt <- ov_dt %>%
mutate(url_domain = map_chr(url,
~ {
if (is.na(.x))
return(NA)
else
return(url_parse(.x)$domain)
}),
bug_domain = map_chr(bug_reports,
~ {
if (is.na(.x))
return(NA)
else
return(url_parse(.x)$domain)
}))
glimpse(ov_dt, 100)## Rows: 18,387
## Columns: 29
## $ package <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
## $ version <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
## $ major <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
## $ minor <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
## $ patch <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
## $ depends <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
## $ imports <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
## $ license <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
## $ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
## $ author <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
## $ bug_reports <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
## $ url <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
## $ date_published <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
## $ description <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
## $ title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
## $ num_dep <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
## $ num_imports <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
## $ num_authors <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
## $ year <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
## $ month <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
## $ day <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
## $ wday <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
## $ yr_mon <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
## $ dt <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
## $ len_title <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
## $ len_desc <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
## $ license_cleaned <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…
## $ url_domain <chr> NA, NA, "shiny.abdn.ac.uk", "github.com", "github.com", NA, NA, NA, NA, …
## $ bug_domain <chr> NA, "github.com", NA, NA, "github.com", NA, NA, NA, NA, NA, "github.com"…
Now that I have the data sets prepared and ready, it’s time for the fun part - being creative and creating some interesting visuals! Let’s attack those questions one at a time.
Q: How have the dependencies & imports changed over time?
Here’s a time series plot of the number of imports and dependencies since 2008. Since the values are integers, adding some jitter adds some much needed separation in the individual values.
The data supports my hypothesis that more recent packages would have a larger set of dependencies. We’re currently at a median of 6 imports. But look at the explosion of package dependencies in the recent past!
Also, what happens mid-2015? There’s a clear elbow in the trend right at that time. It doesn’t seem organic; did some popular package get released which others used as dependenies? did ÇRAN’s measurement system change?
plot_dotplot_ts <- function(dat,
title,
plotmean = FALSE,
dday = 90,
pcutoff = 0.95,
size = 1,
alpha = 0.06){
stopifnot(all(c("dt", "date_published", "y", "median_y") %in% names(dat)))
p95 <- quantile(dat$y, pcutoff)
plot_mean <- function(g, dat, plotmean) {
if (plotmean)
g <- g +
geom_smooth(aes(y = mean_y), color = "blue") +
annotate(
"text",
x = max(dat$date_published) + lubridate::ddays(dday),
y = max(dat$mean_y),
label = sprintf("Mean: %0.0f", max(dat$mean_y, na.rm = TRUE)),
vjust = 0.5,
hjust = 0,
color = "blue"
)
g
}
g <- dat %>%
ggplot(aes(x = dt)) +
geom_jitter(aes(y = y), alpha = alpha, size = size)
g <- g %>% plot_mean(dat, plotmean)
g +
geom_smooth(aes(y = median_y), color = "red") +
scale_x_date(
date_breaks = "1 year",
date_labels = "'%y",
minor_breaks = NULL,
expand = c(NA, 0.2)
) +
scale_y_continuous(minor_breaks = NULL, limits = c(NA, p95)) +
theme(
panel.grid = element_blank(),
axis.ticks.x.bottom = element_line(size = 0.8, colour = "gray"),
axis.ticks.y.left = element_line(size = 0.8, colour = "gray"),
plot.margin = margin(0, 50, 0, 50)
) +
labs(title = title,
caption = sprintf("Y axis clipped at %0.2f percentile", pcutoff),
x = NULL, y = NULL) +
coord_cartesian(clip = "off") +
annotate(
"text",
x = max(dat$date_published) + lubridate::ddays(dday),
y = max(dat$median_y),
label = sprintf("Med: %d", max(dat$median_y)),
vjust = 0.5,
hjust = 0,
color = "red"
)
}
ov_dt %>%
select(date_published, dt, y = num_imports) %>%
timetk::pad_by_time(date_published, .pad_value = 0) %>%
mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>%
group_by(dt) %>%
mutate(median_y = median(y)) %>%
plot_dotplot_ts(
title = "How have package `imports` changed over time?",
dday = 90
) -> p1
ov_dt %>%
select(date_published, dt, y = num_dep) %>%
timetk::pad_by_time(date_published, .pad_value = 0) %>%
mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>%
group_by(dt) %>%
mutate(median_y = median(y))%>%
plot_dotplot_ts(
title = "How have package `dependencies` changed over time?",
dday = 90
) -> p2
p1/p2Interestingly, titles have settled down at a median length of 7, while mean & median description lengths have been monotonically increasing since 2015. Looks like folks have been putting in more effort to describe their packages in an effort to attract R users.
ov_dt %>%
select(date_published, dt, y = len_title) %>%
timetk::pad_by_time(date_published, .pad_value = 0) %>%
mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>%
group_by(dt) %>%
mutate(median_y = median(y))%>%
plot_dotplot_ts(
title = "Title Lengths",
dday = 90
) -> p1
ov_dt %>%
select(date_published, dt, y = len_desc) %>%
timetk::pad_by_time(date_published, .pad_value = 0) %>%
mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>%
group_by(dt) %>%
mutate(median_y = median(y),
mean_y = mean(y))%>%
plot_dotplot_ts(
title = "Description Lengths",
dday = 80,
plotmean = TRUE
) -> p2
p1/p2Another interesting way to look at the same data is by ridge plots
using {ggridges}. It’s easy to see the large spread of
descriptions and how it’s been increasing over time.
deps <- ov_dt %>%
select(year,
`Description Length` = len_desc,
`Title Length` = len_title) %>%
arrange(-year) %>%
filter(!is.na(year), year > 2008) %>%
mutate(year = factor(year, levels = seq(2008, 2022)))
deps %>%
pivot_longer(-year) %>%
ggplot(aes(y = year, x = value, fill = name)) +
stat_density_ridges(
bandwidth = 4,
scale = .95,
quantile_lines = TRUE,
quantiles = 2,
alpha = 0.7,
rel_min_height = 0.01
) +
scale_x_continuous(limits = c(0, 200), expand = c(0, 0)) +
coord_cartesian(clip = "off") +
theme_ridges(center = TRUE) +
theme(legend.position = "top",
legend.title = element_blank()) +
labs(
title = "Distribution of Description & Title Lengths since 2010",
x = NULL,
y = NULL
)
## Licenses {.tabset}
Q: What license is most used? Has there been a change over time?
While barcharts are certainly the go-to to compare frequencies across categories, bubble charts are a fun way to visualize categories. GPL licenses are certainly the most popular license category being used today.
plot_bubbles <- function(dat,
.scale,
plot_radius,
bubble_radius,
textsize,
alpha,
color,
maxiter) {
.qty <- nrow(dat)
theta <- seq(0, 360, length.out = .qty + 1)
dat$x <- plot_radius * cos(theta * pi / 180)[-1]
dat$y <- plot_radius * sin(theta * pi / 180)[-1]
dat$n_scaled <- dat$n / .scale
xpack <- rep(dat$x, times = dat$n_scaled)
ypack <- rep(dat$y, times = dat$n_scaled)
coords <- tibble(
x = xpack + runif(length(xpack)),
y = ypack + runif(length(ypack)),
r = bubble_radius,
label = rep(dat$label, times = dat$n_scaled)
)
packed_coords <-
circleRepelLayout(coords, sizetype = "r", maxiter = maxiter)
packed_coords$layout %>%
ggplot(aes(x, y)) +
geom_point(aes(size = radius, color = coords$label), alpha = alpha) +
coord_equal() +
theme_minimal() +
theme(
legend.position = "none",
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank()
) +
geom_text(
aes(
x = x,
y = y,
label = label
),
color = color,
size = textsize,
fontface = "bold",
data = dat,
hjust = "center",
vjust = "center"
)
}
set.seed(31415)
ov_dt %>%
count(license_cleaned) %>%
top_n(8, n) %>%
mutate(label = sprintf(
"%s\n%s",
license_cleaned,
scales::label_comma(suffix = " Pkgs")(n)
)) %>%
arrange(runif(1:n())) %>%
plot_bubbles(
.scale = 50,
plot_radius = 8,
bubble_radius = .32,
textsize = 8,
alpha = 0.4,
color = "gray30",
maxiter = 1000
)While those bubble charts are nice to look at, a standard issue bar chart is the clear winner to visualize the relative distributions of the categories, especially the cliff event after the top 3 license types.
ov_dt %>%
count(license_cleaned) %>%
ggplot(aes(x = forcats::fct_reorder(license_cleaned, n), y = n, fill = license_cleaned)) +
geom_col() +
coord_flip() +
theme_minimal() +
guides(fill = "none") +
labs(x = NULL, y = NULL, title = "Which license is the most popular?")GPL-2, GPL-3 and MIT licenses seem to have the same adoption rate over time. Again, we see the sudden shifts in data points available after mid-2015.
ov_dt %>% count(license_cleaned) %>% arrange(-n) -> lic_n
ov_dt %>%
group_by(dt) %>%
count(license_cleaned) %>%
mutate(license_cleaned = factor(license_cleaned, levels = lic_n$license_cleaned)) %>%
ggplot(aes(x= dt, y = n, color = license_cleaned)) +
geom_smooth(span = 0.4, se = FALSE, show.legend = FALSE, size = .3) +
geom_jitter(alpha = 0.8) +
theme_light() +
theme(
legend.title = element_blank()
) +
labs(
x = NULL,
y = "Count",
title = "Adoption of Licenses"
)A striking way to visualize time-series data is using streamgraphs.
The {streamgraph}
package is an R API to access the D3 library to make these beautiful
plots. While the previous time-series plot is miminal & most
definitely allows a user to quickly extract a change in trend and
comparison between the groups, this streamgraph is just a joy to look
at.
Here, I’m plotting the series from 2015 to Jun-2022.
ov_dt %>%
group_by(dt) %>%
count(license_cleaned) %>%
ungroup() %>%
timetk::pad_by_time(.date_var = dt, .by = "month", .pad_value = 0) %>%
group_by(dt) %>%
mutate(license_cleaned = factor(license_cleaned, levels = lic_n$license_cleaned),
val = sum(n),
pc = n / val) -> pdat
pdat %>%
filter(dt > "2015-01-01", dt < "2022-06-30") %>%
streamgraph(license_cleaned, n, dt) %>%
sg_axis_x(1, "year", "%Y") %>%
sg_legend(show=TRUE, label="License: ") %>%
sg_fill_brewer("Spectral") %>%
sg_title("License Adoption since 2015")Q: Which domains do packages use for their website?
GitHub is definitely the preferred choice by a majority of the packages, followed by ROpenSci, and CRAN-RProject. Note: Here, I’m grouping all the remaining levels, where counds are < 50 in ‘Other’.
ov_dt %>%
filter(url_domain != "") %>%
mutate(url_domain = forcats::fct_lump_min(url_domain, 50)) %>%
count(url_domain) %>%
mutate(label = sprintf("%s\n%s", url_domain, scales::label_comma()(n))) %>%
arrange(runif(1:n())) %>%
plot_bubbles(
.scale = 50,
plot_radius = 5,
bubble_radius = .3,
textsize = 6,
alpha = 0.4,
color = "gray30",
maxiter = 1300
)A similar story when seen over time.
ov_dt %>%
filter(url_domain != "") %>%
mutate(url_domain = forcats::fct_lump_min(url_domain, 50)) %>%
group_by(dt) %>%
count(url_domain) %>%
ggplot(aes(x = dt, y = n, color = url_domain)) +
geom_smooth(
span = 0.4,
se = FALSE,
show.legend = FALSE,
size = .3
) +
geom_jitter(alpha = 0.8) +
theme_light() +
theme(legend.title = element_blank()) +
labs(x = NULL,
y = "Count",
title = "Domains used for Websites")Q: Which domain do packages use for reporting bugs / filling feature requests? How do these vary over time?
This one’s not even close. GitHub is the dominant player by a long shot.
ov_dt %>%
filter(bug_domain != "") %>%
mutate(bug_domain = forcats::fct_lump_min(bug_domain, 20)) %>%
count(bug_domain) %>%
mutate(label = sprintf("%s\n%s", bug_domain, scales::label_comma()(n))) %>%
arrange(runif(1:n())) %>%
plot_bubbles(
.scale = 35,
plot_radius = 5,
bubble_radius = .3,
textsize = 6,
alpha = 0.4,
color = "gray30",
maxiter = 1300
)And it has been from the start.
ov_dt %>%
filter(bug_domain != "") %>%
mutate(bug_domain = forcats::fct_lump_min(bug_domain, 20)) %>%
group_by(dt) %>%
count(bug_domain) %>%
ggplot(aes(x = dt, y = n, color = bug_domain)) +
geom_smooth(
span = 0.4,
se = FALSE,
show.legend = FALSE,
size = .3
) +
geom_jitter(alpha = 0.8) +
theme_light() +
theme(legend.title = element_blank()) +
labs(x = NULL,
y = "Count",
title = "Domains used for Bug Reporting")That’s it for now. I’ll keep updating this EDA as I find time and think of more questions to explore.
Thanks for reading!